Llama 3 2 3B Dpo Rlhf Fine Tuning
MIT
This model is a fine-tuned version of Llama 3.2-3B-Instruct using Direct Preference Optimization (DPO), designed for reward modeling tasks, suitable for language understanding, instruction response generation, and preference-based answer ranking tasks.
Large Language Model English